Import Libraries

Load Data

Quick Glance at File Summaries

Check for Missing Values

Individual Feature Visualizations

Findings:

Plot 1: Visitors over time
Plot 2 & 3: Occurences of Visitor Count // Occurences of Max Visitor Count by Store ID
Plot 4: Median Visitors by Day of Week
Plot 5 & 6: Median Visitors by Month // Median Visitors by Month-Year

As we will be forecasting the last week of April into May of 2017, we can plot the corresponding timeframe from the 2016 training data provided. We'll extend the timeframe a little to have two months worth of data (Apr 15-Jun 15 of 2016)

Looking at this plot, we can see the blue line which corresponds to the raw data (total visitors by day), and an approximated pink line with a smoothing fit. There is a drop compared to the overall pattern from May 1st to just before May 8th, this corresponds to the holiday Golden Week.

Now we can switch gears from the visitor data (true visits) to the reservation data, we will continue with the AIR data.

We can do a similar datetime visualization with the reservtion dataset as this includes a field that shows the visitor count on that same day. We can also take a look at the time difference bewtween the time the reservation was made at, and when the reservation was made for.

Findings:

Plot 1: Total Reservations by Date

This is the top 5 longest differences (by day) in reservations to visit dates. This contains only 2 unique air store ids, must be very indemand locations where this type of reservation is standard or they could be input errors with incorrect years.

We will now shift gears towards looking at the HPG data, similar dataset as the AIR info but for a different app to schedule reservations. We will perform the same chartings as above for:

Findings:

As these are the same plots, I will only note the differences in general compared to the AIR data versions
Let's take a look at a heatmap indicating where these restaurant locations are in Japan

To get a more numerical view, lets plot out the count of restaurants by area, and count of categories. we'll stick to the top 15 if the lists become to cumbersome to view.

Findings:

Findings:

Taking a look at the adjusted plots taking into consideration just the top level prefecture, Tokyo is by far the most accounted for area with Osaka and Fukuoka filling out the top 3 for number of restaurants.

We now shift focus to taking a top level look at the holidays in Japan from the provided holidays dataset. This will provide some context to the total number of holidays and their potential impact on reservations/visitors. We can also look at where these holidays land along our date range for the provided data.

We find that the same days were holidays in 2016 and 2017 in late April and May, additionally, almost 7% of the dates listed are holidays

We can see that the test dataset, in red, comes in about the last week of April and encompasses up to June of 2017. Our training set has a full year+ of data to work with.

We can start to combine features and analyze their relationships, which in trn need to be interpreted in the context of the individual features distributions, similar to what we did above for some features.

We'll start with average AIR restaurant visitors in relation to the type of cuisine (genre)

Findings:

We can see from these plots that there is some semplance of a weekly pattern which we saw in earlier plots of visitor data, with this we can take a look at the mean values by day of week and genre.

Findings:

We can take a deeper look in relation to the holiday data we previously saw, let's see the impact of holidays on average visitor data overall rather than by category for now:

Based on the first plot, I don't see much of a shift in average visitors holiday vs. not holiday. To see this with a bit more clarity, the second plot shows the breakdown by day of week and we see that holidays that land on weekdays (Mon-Fri) have a higher average visitor count with the weekend showing a negative impact on Saturday and only a slight positive impact on Sunday.